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Abstract 


Access to web-scale corpora is gradually 
bringing robust automatic knowledge base 
creation and extension within reach. To 
exploit these large unannotated—and ex¬ 
tremely difficult to annotate—corpora, un¬ 
supervised machine learning methods are 
required. Probabilistic models of text 
have recently found some success as such 
a tool, but scalability remains an obsta¬ 
cle in their application, with standard ap¬ 
proaches relying on sampling schemes 
that are known to be difficult to scale. In 
this report, we therefore present an empiri¬ 
cal assessment of the sublinear time sparse 
stochastic variational inference (SSVI) 
scheme applied to RelLDA. We demon¬ 
strate that online inference leads to rel¬ 
atively strong qualitative results but also 
identify some of its pathologies—and 
those of the model—which will need to be 
overcome if SSVI is to be used for large- 
scale relation extraction. 


1 Introduction 


Access to web-scale corpora is gradually bringing 
automatic knowledge base creation and extension 
within reach ([Mausam et al., 2012|). Human cu¬ 


rated resources, such as Freebase (Bollacker et al.. 


2008|), are invaluable for relation extraction, but 


they are inherently incomplete. The total num¬ 
ber of relations that might be encountered is un¬ 
bounded and the number actually encountered in 
a corpus grows with its size. Hence the need for 
unsupervised methods and the recent small-scale 
success on this problem with probabilistic mod¬ 
els. Unfortunately, prohibitive memory usage and 


* Work undertaken while the second author was at Xe¬ 
rox Research Centre Europe, supervising the first author’s re¬ 
search internship. 


training time makes their large-scale application 
all but impossible, and the incremental training 
algorithms used to train topic models like latent 


Dirichlet allocation (LDA) at scale ( [Hoffman et 
al., 2010t Mimno et al., 20121 have not yet been 


applied to relation extraction. 

In this paper, we show that sparse stochastic 
variational inference (SSVI) ( [Mimno et al., 201^ 
can be applied to the RelLDA model for unsu¬ 


pervised relation extraction introduced by (Yao et 


[al., 2011] [Yao et al., 2012 1. SSVI is attractive 
for two reasons. First, it processes corpora in¬ 
crementally, speeding convergence and supporting 
streaming. Second, it improves on plain stochas¬ 
tic variational inference by using sparse updates 
able to deal with a large number of topics. We find 
that our algorithm is able to obtain strong qualita¬ 
tive results in a fraction of the time that is needed 
to run the Gibbs sampler for RelLDA and with a 
reduced memory footprint. We also include dis¬ 
cussion of some pitfalls in unsupervised relation 
extraction with LDA-style models and how they 
might be overcome, and we show that dependency 
parse features are not needed for this task, a major 
departure from prior work in this area. 

2 Model Specification 

We use a modified form of RelLDA ( [Yao et aU[ 


20111, eliminating the reliance on a dependency 
parsed corpus. Relations are grouped into clus¬ 
ters. Each document is assumed to behave as a 
mixture of these relation clusters, with each sen¬ 
tence in the document exhibiting exactly one of 
them. Multiple feature sets are permitted, which 
we exploit below to use separate vocabularies for 
entity features, linking word features, and syntac¬ 
tic features. Throughout this paper, we adopt the 
convention that R refers to the number of relation 
clusters, F to the number of feature types, Wf to 
the vocabulary size for feature type /(!</< 
F), Nd to the number of sentences in a document. 

























and Ndi / to the number of features of type / ex¬ 
hibited by sentence i in document d. 

In this notation, the relation clusters are defined 
as a set of F discrete distributions over the feature 
vocabularies: 

For r = 1,..., i? and f = 1,... ,F\ 

Draw Prf ~ Dirichlet(? 7 /), 

where rjf > 0 is a scalarQThe generative process 
for relations takes the following form: 

For d = 1,... ,D: 

1. Draw 9d Dirichlet(a). 

2. For i = 1,..., Nd'. 

(a) Draw Zdi ~ Od- 

(b) For / = 1,... and j = 

l,...,Ndif-. Draw Wdij ~ 

where a > 0 is again a scalar, and 9d defines a dis- 
crefe disfribufion over fhe relation clusfers associ¬ 
ated fo documenf d. The relafion in each senfence 
is drawn from 9d and fhe associafed feafures from 

/3rf. 

2.1 Extracting entity pairs 

We assume access to a part-of-speech (POS) tag¬ 
ger and a named entity recognizer (NER). Our ul¬ 
timate goal is to extract relations between named 
entities and therefore necessarily limit attention to 
sentences with at least two entity mentions. Sen¬ 
tences with more than two mentions pose a prob¬ 
lem due to a priori ambiguity in the pairs being re¬ 
lated, so we simply assume the salient entity pair 
is the one that is closest together in the sentence— 
a simple heuristic that allows us to avoid model¬ 
ing sentence segmentation. We use the Stanford 
CoreNLP library for both POS tagging and NER 
( Finkel et ah, 2005| ). 

2.2 Feature sets 

Our experiments all draw on feature types built ac¬ 
cording to a small set of templates and always re¬ 
flecting only the sequence of words between two 
selected entity mentions in the sentence: 

Entity surface strings. Each sentence contains 
two distinguished entity mentions. The left 
(first) and right (second) strings are treated as 
features of distinct types to capture asymme¬ 
try. The vocabularies for those two types are, 

'Note that this means we are using a symmetric Dirichlet, 
viz. p{l3 I ??) « n„ 


however, the same. We refer to the resultant 
features as and . 


Entity types. The Stanford NER outputs entity 
types in { PER , ORG , LOG , MISC }, re¬ 
ferring to the person, organization, location, 
and miscellaneous, respectively. We use the 
pair (ti, ^ 2 ) of entity types for the two dis¬ 
tinguished entities as a feature. This feature 
type is referred to as ENT-TYPE . 

Phrases between the entities. The word se¬ 
quence between the entities is partitioned 
into coarse-grained part-of-speech cate¬ 
gories: ADJ (JJ, JJR, JJS), ADV (RB, RBR, 
RBS), NN (NN, NNS, NNP, NNPS, PRP, 
WP), PP (IN, TO), VB (VB, VBD, VBG, 
VBN, VBP, VBZ), and OTH (everything 
else). We refer to the resultant six feature 
sets as ADJ, ADV , nn , oth , PP , and VB . 


POS tag sequences. We include a feature corre¬ 
sponding to the entire sequence of Penn Tree- 
bank POS tags between the two entities. We 
refer to this feature type as POS-SEQ . 


3 Sparse Stochastic Variational Inference 


To make inference scalable to very large corpora, 
we use the sparse stochastic variational inference 
(SSVI) originally developed for EDA ( |Mimno ^ 
ah, 20121. The true posterior over is ap 


proximated by a product of independent Dirich- 
lets, viz. 


R F 

q{/3i-R,i-F )=n n 

r=l/=l 

where q{l5rf) = Dirichlet(Aj./) and A^/ G 
are variational parameters. Classical variational 
Bayes would also approximate the posterior over 
9d and Zdi by Dirichlet and multinomial distribu¬ 
tions, respectively, leading to 0.{DR) memory us¬ 
age and Q{R) time for local updates. SSVI re¬ 
duces both requirements to 0(1) by eliminating 
the local variational distribution. Instead, it inte¬ 
grates out 9d and uses samples from the an opti¬ 
mized variational distribution q*{zd) to estimate 
the expectations required in the updates. Here 
the optimality criterion for q* is simply that its 
Kullback-Eeibler divergence from the true Zd pos¬ 
terior is as small as possible within the constraints 


imposed by its factored form (Bishop, 20061. 











Furthermore, the entire corpus need not be con¬ 
sidered during each step; rather, a random mini¬ 
batch B = {di,..., ^ 5 } of documents is con¬ 
sidered and sampling is carried out only for those 
documents. Each iteration thus only needs to up¬ 
date the parameters associated with relations r, 
features types /, and features values v encoun¬ 
tered in B. This leads to the following variational 
updates: 

= (1 - +?'■’ • § E E 

^ d&B 

where is the learning rate, N^rfv is the num¬ 
ber of times feature value v of type / is assigned 
to relation r in document d and E denotes a Monte 
Carlo estimate of an expectation. Using a trick 
we explain in the supplement, we can ensure that 
each iteration only updates parameters Ar/„ for 
relations r, feature types /, and feature values v 
that occur in that iteration’s minibatch (the origin 
of the sparse moniker). The supplement likewise 
explains our natural gradient hyperparameter opti¬ 
mization scheme for rjf and a. 

4 Empirical Evaluation 

4.1 Datasets 

We use the Aquaint2 2 corpus, consisting of ar¬ 
ticles from several newspapers including the New 
York Times dVorhees and Graff, 2008| |. After 
eliminating sentences with fewer than two entities, 
we were left with 578790 documents (1492599 
sentences), of which 462755 (1193275 sentences) 
were used in training and the remainder used for 
evaluation. The sizes of the feature sets for this 
data were: 8996 (adj ), 7334 (adv ), 233725 
(ENT^^-^* ), 233725 {ENT^^aht 39395 
52998 (OTH ), 16564 (pp ), 28826 (VB ), 89022 
(POS-SEQ ), and 16 (ENT-TYPE ). We consider 
two subset of the features in our experiments: 

1. The full feature set: ADJ , ADV , ENT^®-^* , 
^j^jright ^ Qrpjj , PP , VB , POS-SEQ , and 
ENT-TYPE . 

2. All features excluding the entity features: 
ADJ , ADV , OTH , PP , VB , POS-SEQ , and 
ENT-TYPE . 

4.2 Model selection 

The hyperparameters are optimized as part of the 
algorithm. SSVI includes a learning rate gen¬ 


erally set to 

it) ^ » 

^ (b+tr^ 

where a,b > 0 and ^ < c < 1. This choice of 
schedules allows convergence of the algorithm to 
a local optimum of the objective to be guaranteed 
( Hoffman et ah, 201 3| l. In practice, setting c at or 
close to 2 give good results. 

We fit the model with several values of 7?, a, 
and b and score each based on its perplexity and 
variational objective values on an evaluation cor¬ 
pus. We carried out a grid search for values with 
R E {250,500,1000}, o E {0.1,0.01,0.001}, 
and b E {1.0,10.0}. We find that the choice of 
these parameters has a noticeable but not substan¬ 
tial effect on the metrics. Nonetheless, we limited 
our qualitative evaluation to the best learning rates 
in terms of the variational objective (—9.02 x 10®), 
that is, a = 0.01,6 = 10.0 and K = 500. The 
number of iterations T of SSVI, on the other hand, 
had a substantial effect. Figure [T] illustrates this 
with varying values of R and a = 0.1, 6 = 1.0. 

4.3 Discovered relations 

Evaluating the quality of the relations discovered 
by our algorithm is challenging in the absence of 
ground truth, especially due to the inherent nois¬ 
iness of relation clusters discovered by any un¬ 
supervised learning algorithm—and by stochas¬ 
tic gradient methods in particular. Ordinarily, the 
output of EDA-type models is shown as per-topic 
rankings of the vocabulary. In our setting, this 
makes little sense due to the multi-view setup and 
the fact that, e.g., the most likely entities under 
a relation need not correspond to the most likely 
noun phrases. We thus represent relation clus¬ 
ters as lists of sentences most strongly associated 
with them. The strength of association was deter¬ 
mined by taking 50 posterior samples of the rela¬ 
tion assignment for each sentence and computing 
the proportion of samples assigned to each rela¬ 
tion. 

As Table [T] shows, the clusters are reasonably 
coherent but quite noisy. The first corresponds to 
a general constellation of relations between peo¬ 
ple and organizations that could reasonably be 
summarized as “occupies leadership position at,” 
though in reality, the generalization made by the 
inference procedure is somewhat narrower than 
that, with a bias toward political leaders. The sec¬ 
ond is much more restricted and basically corre¬ 
sponds to the concept of being a “market strategist 









Figure 1: (Left) Perplexity on an evaluation corpus for SSVI as a function of iteration (o = 0.01, b = 
1.0). (Right) A comparison of evaluation perplexity for SSVI and Gibbs sampling with R = 1000. 


leader-at relation 

European / Peter Mandelson / trade commissioner / NN NN OT / MISC-PER 
UN / Joao Bernardo Honwana / special envoy / NN NN / ORG-PER 
UN / Pierre Goldschmidt / director general /’s deputy / ORG-PER 
Zimbabwe / Morgan Tsvangirai / opposition leader / NN NN / LOC-ORG 
Spanish / Jose Antonio Alonso / counterpart, / NN OT / MISC-PER 
European Union / Pascal Lamy / trade commissioner / NN NN / ORG-PER 
ASIO / Dennis Richardson / director general / OT NN JJ / ORG-PER 
WTO / EU / trade commissioner / NN NN OT / MISC-PER 
UN / Jacques Klein / special envoy / NN NN / ORG-PER 
pro-Russian / Viktor Yanukovich / opposition leader / NN NN / MISC-PER 


trader-at/market-strategist-at relation 
Roma / Livorno / bottom club / 2-0 away to / VBD CD RB TO NN NN / ORG-LOC 
Art Hogan / Jefferies and Co. / market strategist at /, chief / OT JJ NN NN IN / PER-ORG 
Kenneth Tower / CyberTrader / market strategist at /, chief / OT JJ NN NN IN / PER-ORG 
Chinese / Ssangyong / bidder for / firm , / as the / NN OT VBD IN DT JJ NN IN / MISC-ORG 
Michael Sheldon / Spencer Clark LLC / market strategist at /, chief / OT JJ NN NN IN / PER-ORG 
Oracle Corp. / PeopleSoft / business software maker / bid for / ORG-ORG 
US / Asian / market strategist at /, chief / OT JJ NN NN IN / PER-ORG 
SAIC / Birmingham-based / automaker , / fortunes of / one billion / that could potentially /1.85 billion / ORG-MISC 
Barry Ritholtz / Maxim Group / market strategist at / , chief / OT JJ NN NN IN / PER-ORG 
A1 Goldman / AG Edwards / market strategist at /, chief / OT JJ NN NN IN / PER-ORG 


Table 1: Sentences in the corpus most strongly associated with one of the relations, as determined by 
sampling relation assignments. fTop) This particular relation appears to identify the concept of “occupies 
leadership position at,” while the second relation (bottom) appears to identify the concept of “trader at” 
or “trading strategist at.” Parameters were set to i? = 500, a = 0.009, and 6 = 10.0 for both. 


at.” Even so, the model picks up on the fact that 
“bidder for” is a closely related concept and ex¬ 
presses a similar relationship between the person 
and organization in question. 

4.4 Clustering pathologies 

The results we show correspond to a feature set 
excluding and , as we found cer¬ 

tain pathologies in the output with the full feature 
set, notably a tendency for some relation clusters 
to form around sets of entities rather than the rela¬ 
tions between them. Figurej^illustrates this effect. 

Removing entity features resolves this first is¬ 
sue. Overcoarsening of relation clusters is a more 
persistent problem. Some relations look more like 
broad topics than focused relations. The likely 
cause of this is allocation of topic words that co¬ 
occur with relation words to that relation due to the 
absence of a special set of shared topic distribu¬ 
tions that could catch the intruding words. Figure 
[^illustrates this problem within one relation. 


The incorporation of syntactic features is an¬ 
other source of over-coarsening, with some re¬ 
lation clusters forming around common syntactic 
patterns. The best illustration of this are the POS 
tag sequences “NNS IN” and “NN IN”, which 
served as the basis for clustering of unrelated con¬ 
cepts like “headquarters in,” “crisis in,” and “meet¬ 
ing in.” 

At their core, the pathologies we uncover all ap¬ 
pear to flow from problems in the model rather 
than the inference scheme—notably the require¬ 
ment that each word be explained by a relation. 
The absence of any broader shared topic distri¬ 
butions that can be used to explain away non- 
relation-specific words causes some relations to 
behave very much like topics and all relations to 
catch many co-occurrent words that are not essen¬ 
tially part of their semantic content. Fikewise, al¬ 
though the addition of syntactic features allows 
abstraction away from specific word patterns to 
more generally applicable syntactic ones, it also 














Entity-based relation 

Topic-like relation 

French Riviera / Cannes / resort of 

Northern Gaza / Israeli / withdrawal of the / and the 
Gaza / Israeli / withdrawal of the / and the 

France / China / deficit with 

policemen and / city of /near the / were killed 
blows himself / suicide bomber / up during / when a 
incursion into / the northern 
rebel stronghold of / roadside bomb / were wounded 


Figure 2: (Left) A relation based on sets of related entities. (Right) A topic-like relation. Both exclude 
POS-SEQ and ENT-TYPE features for brevity. 


leads to problems if the model does not account 
for syntactic overlap of semantically distinct re¬ 
lations. Both of these issues could be addressed 
by adding additional hierarchy to the model. For 
instance, a set of global topic distributions could 
be added to resolve the first problem, while rela¬ 
tions could be grouped into higher-level clusters 
governing syntactic properties to resolve the other. 
We believe such modifications are likely to lead to 
much more robust models of relations in text with¬ 
out significantly complicating inference. 


4.5 Comparison to Gibbs sampling 

Since we use a minihatch size of 5 = 256 and 
S' = 25 samples to form our estimate of the nat¬ 
ural gradient, each iteration of SSVI corresponds 
to 6400 document steps in the Gibbs chain. As 
a result, one full Gibbs sweep through the corpus 
is equivalent to about 75 SSVI iterations in terms 
of numbers of samples takenj^ We use this as the 
basis of the plot in Figure [T] 

Surprisingly, Gibbs sampling appears to achieve 
better held-out perplexity at each given level of 
computation. This is contrary to the expected be¬ 
havior of SSVI ( Hoffman et al., 2010^ |Mimno| 


et al., 20121 and does not have a clear explana¬ 


tion. The most likely causes are, first, the learn¬ 
ing parameters, as stochastic gradient methods are 
known to be extremely sensitive to the choice of 
learning rate ( [Ranganath et al., 2013| l and, sec¬ 
ond, the inherent noisiness of stochastic gradient 
methods, which work best on large, highly redun¬ 
dant corpora. Although we lightly optimized the 
learning parameters, it is possible that more ex¬ 
tensive experiments would discover a drastically 
better setting of those parameters; alternatively, 
adaptive rate methods may be needed. If, on the 
other hand, is simply the noisiness of the stochas¬ 
tic gradients, then variance reduction techniques 


more exact number is ~ 72.3. 


may yield better results (Paisley et al., 20121. 


It is also important to account for the compu¬ 
tational aspect of the performance metric. Often, 
one can drastically reduce the size of the mini¬ 
batches in SSVI (e.g. to 64 documents), which 
would lead to multiplicative speedups (e.g. 4x if 
S = 64); likewise, the number of Gibbs sweeps 
used for estimation on the minibatch could be re¬ 
duced, as could the burnin for those sweeps. Such 
fine-tuning is beyond the scope of this work but, 
based on our results, would be a crucial compo¬ 
nent in practical systems seeking to reap the bene¬ 
fits of SSVI with models like RelLDA. 


Finally, it may simply be that for a complex 
model like RelLDA, the data set must be made 
far larger before sufficient redundancy appears, 
in which case we would expect to see gains in 
the relative performance of SSVI and Gibbs sam¬ 
pling in the regime of larger information extrac¬ 
tion datasets, which often contain hundreds of mil¬ 
lions of documents. This last point also illus¬ 
trates how SSVI might be advantageous even if 
less statistically efficient: unlike Gibbs sampling, 
whose memory usage grows with the size of the 
corpus, SSVI can operate with a fixed amount of 
memory—^just enough to store the minibatch data 
structures. 


5 Conclusion 

We have shown that SSVI is a promising technique 
for relation extraction at scale. Apart from some 
pathologies due to the modeling assumptions, it 
discovers coherent relational clusters while requir¬ 
ing less memory and time than sampling methods. 
Moreover, the issues we uncover point to problems 
with the model that suggest how more effective 
probabilistic models of relations in text might be 
designed and used. 
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Appendices 

In the following appendices, we explain the math¬ 
ematics of our inference algorithm in detail. 

A The core algoritihm 


q*{Zd) OC exp (E^\.^[logp{zi:D, Vi-D, f3l-.R, 1:f]) 

oc exp [logp{vd I Zd, (3) + \ogp{zd \ a)]^ 

OC p{zd I a) n n 6Xp [log ) 

r, fv: 


r(i2a) TTr(Orfr+Q;)\ 

nOd + Ra) ■ y r(a) ) 

xn n exp (^Ndrfv ^(-^r/)])- 

r^ f V A^dr/i;^0 


We thus find 


An alfernafive fo MAP inference via Gibbs sam¬ 
pling is variafional inference. Ordinarily, fhis 
would be done by specifying a variafional disfribu- 
fion over fhe (3, z, and 9 variables. In our sefup, be¬ 
cause each observation consisfs of mulfiple words, 
fhe sfandard way of doing fhis fails, however. For- 
funafely, we can use a recenf sfochasfic approach 
fhaf still works and fhaf scales much beffer fhan 
bafch variafional Bayes. This approach is based 


on fhe sfrafegy for LDA sef ouf in (Mimno el al.. 


20121 . 


To do fhis, we posif 


Q'(/3r/) Dir(Aj.y), G 




q*izdo = r I zY°) oc 


( 1 ) 


{Odr + O') 

xn n exp {Ndofv [^(Ar/,;) - ^'(Ar-/)]), 

/ ^ • ^dofv ^0 

which means we can approximafely sample from 
q* using Gibbs sampling fo obfain an approxima¬ 
tion fo Eq[Ndrfv]- 


Why is fhis helpful? As shown in (Hoffman el 


al., 20131, fhe nalural gradienl of £ in fhe rfv di¬ 
mension is given by 


E, 


9(z1:d) 




'rfv 


. d 


T 9 ^rfv 


and lei q{zd) be an arbifrary disfribulion, which 
will be chosen fo be fhe optimal one per fhe analyl- 
ical (buf uncompulable) variational Bayes updale 
formula. The mixing dislribufions 6 are marginal¬ 
ized ouf as in collapsed Gibbs sampling. Since our 
goal is fo optimize A, we wrile fhe ELBO up fo a 
consfanl independenl of A: 


d rj L 


rjf Xrf 


[3^drfv] + 

X Eg [log (3rfv] 


+ ( E logr(A,./^) - logr(A^/) 


Splil up over documents, this gives a per- 
document contribution of 

[R^drfv] A ^ (ti ~ ^rfv) ■ 

This means that if we sample a batch of docu¬ 
ments di,... ,ds and approximate [Ndrfv] 

using S' rounds of Gibbs sampling, we will end 
up with an unbiased estimate of the natural gradi¬ 
ent that we can use for stochastic gradient ascent 
on C. With a little bit more work, we can make all 
necessary updates sparse to ensure efficiency. 

Concrefely, each iteration of fhe algorilhm does 
fhe following. 

1. Sample a minibalch M = {di ,..., ds} of S 
documenfs (wilhoul replacemenl). 


where A^/ = ^rfv We know Eq[log/3rfv] = 
'^{Kfv) — 'l’(Ar/), and we use sampling over Zd 
fo approximate Eq[Ndrfv]- Specifically, basic fhe- 
ory fells us fhaf fhe optimal choice of variafional 
disttibufion over Zd, holding fhose over all ofher 


2. Run B bum-in rounds of Gibbs sampling on 
Zd for d G Ad using O- Then run S' more 
sweeps, saving fhe value of Yhd&M ^drfv af¬ 
ter each one. Estimate YldeM ^qWdrfv] by 


N^fv ■■ 


— S' 2Es'=i 


drfv' 













3. Estimate the rfv eomponent of the overall 
natural gradient by 

9rfv ■ ~ ' ^rfv Vf ~ ^rfv ■ 

4. Update 

^rfv ^ ^rfv “h P9rfv^ 

where p = ptis the eurrent learning rate. 

Note that if we write Ndrfv = § • 

Nrfv = \fv — Vf (this is the pseudoeount part 
of the variational parameter), we have = 

Nrfv + rjf and henee an update of the form 

Nrfv ^ (t P')Nrfv “h pNrfv 

Note further that if we let vr* = nt=o ~ Pr)^ 
we ean write this update as 

NffP ^ pNfl 

TTt TTi-l TXt 

Thus, if we traek rather than the raw pseu- 
doeount, we get sparse updates. This is what the 
code actually does. 

B Adding hyperparameter optimization 

In its current form, the variational inference algo¬ 
rithm requires the Dirichlet hyperparameters rjf to 
the global relation distributions and a to the 
local mixing distributions 9 d to be set manually. 
To remove this limitation, we extend the natural 
gradient descent scheme to the hyperparameters. 

To begin, note that the part of the variational 
objective that depends on r/y is given by 

C{rif) = pf ■ EE [T'(A^yJ - 4>{Arf)] 

r V 

-R-[VflogT{pf)-logT{VfPf)], 

whence 

f)r r 

^ = E Ei'f(■'-/•) 

If r {_ V 

- Vf ■ [^^(A,y) - 4^{VfPf)] 

However, we would like to use natural gradient 
updates, which have the form 

= pf + Pt G^J NrifC, 


where = E I >?/) ^ | {pf)is 

the Fisher information matrix for the parameter pf 
evaluated at the value p^p. Since 

logp(/3/ I Pf) = {pf - 1) • ^log%^ 

r,v 

- R{VflogT{pf) - logT{VfPf)). 

This is easy to compute. 

Indeed, if we write log p(/?/ | pf) = t{j5f)-pf — 
tWf) - a{pf) with t{l3f) = i2r iZv ^ogfrfv, we 
need only compute E {t{j3f) — a'{pf))^ , which, 
by the usual exponential family identities, is given 
by 

E[ti(3ff] -E[tif3f)f = a”{pf). 

Fortunately, we know 

a\pf) = RVf ■ [4>{pf) - 4>{VfPf)] , 
so we can calculate 

GvjiPf) = a”{pf) = RVflilJiipf) - VfipiiVfPf)] , 

where V’l = 'k' is the first polygamma function 
(the trigamma function). Note that, analogously. 

Go (a) = DR ■ [ipi{a) — Ripi{Ra)] . 

The (unnatural) gradient for a is harder to com¬ 
pute, however: 

dC d [ d ' 

fc=E8sE,[logpfe|o)l = y;E, . 

d d ^ 

Since the expectation cannot be analytically com¬ 
puted, we use our samples ^ for s' = 1,..., 5' 
and d G Ad to compute a stochastic gradient. For 
this, we first note that for fixed Zd, 

logp{zd |a) = X] [logr(Odr + a) - logr(a)] 

r 

+ logr(i?a) — logr(Orf -|- Ra), 

whence 

d 

— \ogp{zd |a) = X] + a) - ^'(a)] 

r 

+ R • [^(i?a) - 4f{Od + Ra)] 

= ■■ ga{zd), 

where Odr denotes the number of sentences in d 
assigned to relation r and Od = Gdr- We thus 
obtain a stochastic gradient 
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